{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Forward propagation:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```python\n",
    "Z1 = X.W1 + b1\n",
    "A1 = ReLU(Z1)  \n",
    "Z2 = A1.W2 + b2\n",
    "exp_scores = exp(Z2)  \n",
    "probs = exp_scores / sum(exp_scores)\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Backward propagation:\n",
    "\n",
    "```python \n",
    "delta3 = probs\n",
    "delta3[range(len(X)), y] -= 1\n",
    "dW2 = A1.T.dot(delta3)\n",
    "db2 = sum(delta3)\n",
    "delta2 = delta3.dot(W2.T) * (A1 > 0)\n",
    "dW1 = X.T.dot(delta2)\n",
    "db1 = sum(delta2)\n",
    "\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here:\n",
    "\n",
    "X is the input data matrix of shape (num_samples, input_size), W1 is the weight matrix connecting the input layer to the hidden layer of shape (input_size, hidden_size), b1 is the bias vector for the hidden layer of shape (hidden_size,), A1 is the output of the hidden layer (also known as the hidden representation) of shape (num_samples, hidden_size), W2 is the weight matrix connecting the hidden layer to the output layer of shape (hidden_size, output_size), b2 is the bias vector for the output layer of shape (output_size,), Z2 is the weighted sum of the hidden layer output, exp_scores is the exponential of the output layer weighted sum, probs is the output probability for each class, and y is the true label vector of shape (num_samples,).\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "class TwoLayerNet:\n",
    "    def __init__(self, input_size, hidden_size, output_size):\n",
    "        self.params = {}\n",
    "        self.params['W1'] = np.random.randn(input_size, hidden_size)\n",
    "        self.params['b1'] = np.zeros(hidden_size)\n",
    "        self.params['W2'] = np.random.randn(hidden_size, output_size)\n",
    "        self.params['b2'] = np.zeros(output_size)\n",
    "\n",
    "    def forward(self, X):\n",
    "        W1, b1 = self.params['W1'], self.params['b1']\n",
    "        W2, b2 = self.params['W2'], self.params['b2']\n",
    "        z1 = np.dot(X, W1) + b1\n",
    "        a1 = np.maximum(0, z1) # ReLU activation function\n",
    "        z2 = np.dot(a1, W2) + b2\n",
    "        # probs = 1 / (1 + np.exp(-z2)) # Sigmoid activation function\n",
    "        exp_z = np.exp(z2)\n",
    "        probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)\n",
    "        return probs\n",
    "\n",
    "    def loss(self, X, y):\n",
    "        probs = self.forward(X)\n",
    "        correct_logprobs = -np.log(probs[range(len(X)), y])\n",
    "        data_loss = np.sum(correct_logprobs)\n",
    "        return 1.0/len(X) * data_loss\n",
    "\n",
    "    def train(self, X, y, num_epochs, learning_rate=0.1):\n",
    "        for epoch in range(num_epochs):\n",
    "            # Forward propagation\n",
    "            z1 = np.dot(X, self.params['W1']) + self.params['b1']\n",
    "            a1 = np.maximum(0, z1) # ReLU activation function\n",
    "            z2 = np.dot(a1, self.params['W2']) + self.params['b2']\n",
    "            # probs = 1 / (1 + np.exp(-z2)) # Sigmoid activation function\n",
    "            exp_z = np.exp(z2)\n",
    "            probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)\n",
    "\n",
    "            # Backpropagation\n",
    "            delta3 = probs\n",
    "            delta3[range(len(X)), y] -= 1\n",
    "            dW2 = np.dot(a1.T, delta3)\n",
    "            db2 = np.sum(delta3, axis=0)\n",
    "            delta2 = np.dot(delta3, self.params['W2'].T) * (a1 > 0) # derivative of ReLU\n",
    "            dW1 = np.dot(X.T, delta2)\n",
    "            db1 = np.sum(delta2, axis=0)\n",
    "\n",
    "            # Update parameters\n",
    "            self.params['W1'] -= learning_rate * dW1\n",
    "            self.params['b1'] -= learning_rate * db1\n",
    "            self.params['W2'] -= learning_rate * dW2\n",
    "            self.params['b2'] -= learning_rate * db2\n",
    "\n",
    "            # Print loss for monitoring training progress\n",
    "            if epoch % 100 == 0:\n",
    "                loss = self.loss(X, y)\n",
    "                print(\"Epoch {}: loss = {}\".format(epoch, loss))\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code defines a TwoLayerNet class with an initializer that takes the input size, hidden size, and output size as arguments. The weights and biases for the two layers are initialized randomly in this function.\n",
    "\n",
    "The forward function takes an input X and performs the forward propagation to calculate the output probabilities for each class.\n",
    "\n",
    "The loss function calculates the cross-entropy loss between the predicted probabilities and the true labels y.\n",
    "\n",
    "The train function performs the backpropagation to update the weights and biases based on the calculated gradients. The number of training epochs and learning rate can be specified as arguments to this function."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an example of how to use the TwoLayerNet class to train and test the network on a toy dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: loss = 0.8791617000548932\n",
      "Epoch 100: loss = 0.03272609589944909\n",
      "Epoch 200: loss = 0.010130354895034843\n",
      "Epoch 300: loss = 0.005517334222420798\n",
      "Epoch 400: loss = 0.0036701620853277555\n",
      "Epoch 500: loss = 0.002707635703438397\n",
      "Epoch 600: loss = 0.0021206045443387493\n",
      "Epoch 700: loss = 0.0017317523015295431\n",
      "Epoch 800: loss = 0.0014568091215886065\n",
      "Epoch 900: loss = 0.0012539964886349238\n",
      "Predictions:  [0 1 1 0]\n"
     ]
    }
   ],
   "source": [
    "# Generate a toy dataset\n",
    "X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n",
    "y = np.array([0, 1, 1, 0])\n",
    "\n",
    "# Initialize a neural network\n",
    "net = TwoLayerNet(input_size=2, hidden_size=10, output_size=2)\n",
    "\n",
    "# Train the neural network\n",
    "net.train(X, y, num_epochs=1000)\n",
    "\n",
    "# Test the neural network\n",
    "probs = net.forward(X)\n",
    "predictions = np.argmax(probs, axis=1)\n",
    "print(\"Predictions: \", predictions)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Improvements \n",
    "\n",
    "There are several ways to improve the implementation of a two-layer neural network with softmax. Here are a few suggestions:\n",
    "\n",
    "1. Weight initialization: The current implementation initializes the weights randomly using a Gaussian distribution. However, it is recommended to use other weight initialization methods such as Xavier or He initialization to improve convergence and avoid vanishing or exploding gradients. One possible implementation for Xavier initialization of the weights is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Xavier initialization\n",
    "self.params['W1'] = np.random.randn(input_size, hidden_size) / np.sqrt(input_size)\n",
    "self.params['W2'] = np.random.randn(hidden_size, output_size) / np.sqrt(hidden_size)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Learning rate decay: The learning rate is a hyperparameter that determines the step size at each iteration during training. However, using a fixed learning rate may lead to suboptimal performance or slow convergence. A common technique is to gradually decrease the learning rate over time, known as learning rate decay, to fine-tune the network weights as the optimization process progresses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Learning rate decay\n",
    "learning_rate = 0.1\n",
    "lr_decay = 0.95\n",
    "lr_decay_epoch = 100\n",
    "for epoch in range(num_epochs):\n",
    "    # ...\n",
    "    if epoch % lr_decay_epoch == 0:\n",
    "        learning_rate *= lr_decay"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Regularization: Overfitting can occur when the model is too complex and the training data is limited. Regularization techniques such as L1 or L2 regularization can be applied to the loss function to prevent overfitting and improve the generalization performance of the model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# L2 regularization\n",
    "reg_lambda = 0.1\n",
    "data_loss += 0.5 * reg_lambda * (np.sum(self.params['W1'] ** 2) + np.sum(self.params['W2'] ** 2))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Mini-batch training: The current implementation updates the weights using the entire training set at each iteration, which can be computationally expensive for large datasets. An alternative is to use mini-batch training, where a random subset of the training data is used at each iteration to update the weights. This can speed up the training process and improve convergence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mini-batch training\n",
    "batch_size = 64\n",
    "num_batches = len(X) // batch_size\n",
    "for epoch in range(num_epochs):\n",
    "    for i in range(num_batches):\n",
    "        # Select a random batch of data\n",
    "        batch_mask = np.random.choice(len(X), batch_size)\n",
    "        X_batch = X[batch_mask]\n",
    "        y_batch = y[batch_mask]\n",
    "\n",
    "        # Forward and backward propagation using the batch data\n",
    "        # ...\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Optimization algorithm: The current implementation uses stochastic gradient descent (SGD) as the optimization algorithm. However, there are other optimization algorithms such as Adam, Adagrad, and RMSprop that can improve the convergence speed and performance of the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Adam optimization\n",
    "beta1, beta2 = 0.9, 0.999\n",
    "eps = 1e-8\n",
    "mW1, vW1 = 0, 0\n",
    "mW2, vW2 = 0, 0\n",
    "for epoch in range(num_epochs):\n",
    "    # Forward and backward propagation\n",
    "    # ...\n",
    "    # Update parameters using Adam optimization\n",
    "    mW1 = beta1 * mW1 + (1 - beta1) * dW1\n",
    "    vW1 = beta2 * vW1 + (1 - beta2) * (dW1 ** 2)\n",
    "    mW2 = beta1 * mW2 + (1 - beta1) * dW2\n",
    "    vW2 = beta2 * vW2 + (1 - beta2) * (dW2 ** 2)\n",
    "    self.params['W1'] -= learning_rate * mW1 / (np.sqrt(vW1) + eps)\n",
    "    self.params['b1'] -= learning_rate * db1\n",
    "    self.params['W2'] -= learning_rate * mW2 / (np.sqrt(vW2) + eps)\n",
    "    self.params['b2'] -= learning_rate * db2\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other extensions: "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "* Arbitrary activation function \n",
    "* Arbitrary loss function \n",
    "* Extension to multiple layers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import numpy as np\n",
    "\n",
    "class ActivationFunction:\n",
    "    def __init__(self):\n",
    "        pass\n",
    "\n",
    "    def __call__(self, x):\n",
    "        raise NotImplementedError\n",
    "\n",
    "    def derivative(self, x):\n",
    "        raise NotImplementedError\n",
    "\n",
    "class ReLU(ActivationFunction):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "\n",
    "    def __call__(self, x):\n",
    "        return np.maximum(0, x)\n",
    "\n",
    "    def derivative(self, x):\n",
    "        return (x > 0).astype(float)\n",
    "\n",
    "class Softmax(ActivationFunction):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "\n",
    "    def __call__(self, x):\n",
    "        exp_scores = np.exp(x - np.max(x, axis=1, keepdims=True))\n",
    "        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)\n",
    "        return probs\n",
    "\n",
    "    def derivative(self, x):\n",
    "        raise NotImplementedError\n",
    "\n",
    "class MultiLayerNet:\n",
    "    def __init__(self, input_size, hidden_sizes, output_size, activation_function, loss_function, reg_lambda=0.0):\n",
    "        self.params = {}\n",
    "        self.num_layers = 1 + len(hidden_sizes)\n",
    "        self.layer_sizes = [input_size] + hidden_sizes + [output_size]\n",
    "\n",
    "        for i in range(1, self.num_layers + 1):\n",
    "            self.params[f'W{i}'] = np.random.randn(self.layer_sizes[i-1], self.layer_sizes[i]) / np.sqrt(self.layer_sizes[i-1])\n",
    "            self.params[f'b{i}'] = np.zeros(self.layer_sizes[i])\n",
    "\n",
    "        self.activation_function = activation_function\n",
    "        self.activation_function_derivatives = [activation_function.derivative for _ in range(self.num_layers)]\n",
    "        self.loss_function = loss_function\n",
    "        self.reg_lambda = reg_lambda\n",
    "\n",
    "    def forward(self, X):\n",
    "        layer_output = X\n",
    "        self.layer_inputs = []\n",
    "        self.layer_outputs = [X]\n",
    "\n",
    "        for i in range(1, self.num_layers + 1):\n",
    "            W, b = self.params[f'W{i}'], self.params[f'b{i}']\n",
    "            layer_input = np.dot(layer_output, W) + b\n",
    "            self.layer_inputs.append(layer_input)\n",
    "            layer_output = self.activation_function(layer_input)\n",
    "            self.layer_outputs.append(layer_output)\n",
    "\n",
    "        return layer_output\n",
    "\n",
    "    def backward(self, X, y, output):\n",
    "        delta = output - y\n",
    "        dW = {}\n",
    "        db = {}\n",
    "        delta = delta / X.shape[0]\n",
    "\n",
    "        for i in reversed(range(1, self.num_layers + 1)):\n",
    "            layer_input = self.layer_inputs[i-1]\n",
    "            activation_derivative = self.activation_function_derivatives[i-1](layer_input)\n",
    "            dW[f'W{i}'] = np.dot(self.layer_outputs[i-1].T, delta) + self.reg_lambda * self.params[f'W{i}']\n",
    "            db[f'b{i}'] = np.sum(delta, axis=0)\n",
    "            delta = np.dot(delta, self.params[f'W{i}'].T) * activation_derivative\n",
    "\n",
    "        return dW, db\n",
    "\n",
    "    def loss(self, X, y, output):\n",
    "        data_loss = self.loss_function(output, y)\n",
    "        reg_loss = 0.0\n",
    "\n",
    "        for i in range(1, self.num_layers + 1):\n",
    "            reg_loss += 0.5 * self.reg_lambda * np.sum(self.params[f'W{i}'] ** 2)\n",
    "\n",
    "        total_loss = data_loss + reg_loss\n",
    "        return total_loss\n",
    "\n",
    "    def train(self, X, y, num_epochs, learning_rate=0.1):\n",
    "        for epoch in range(num_epochs):\n",
    "            # Forward propagation\n",
    "            output = self.forward(X)\n",
    "\n",
    "            # Backward propagation\n",
    "            dW, db = self.backward(X, y, output)\n",
    "\n",
    "            # Update parameters\n",
    "            for i in range(1, self.num_layers + 1):\n",
    "                self.params[f'W{i}'] -= learning_rate * dW[f'W{i}']\n",
    "                self.params[f'b{i}'] -= learning_rate * db[f'b{i}']\n",
    "\n",
    "            # Print loss for monitoring training progress\n",
    "            if epoch % 100 == 0:\n",
    "                loss = self.loss(X, y, output)\n",
    "                print(f\"Epoch {epoch}, loss: {loss}\")\n",
    "\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Generate a toy classification dataset\n",
    "X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)\n",
    "\n",
    "# Split the dataset into training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "# Normalize the input data\n",
    "mean = X_train.mean(axis=0)\n",
    "std = X_train.std(axis=0)\n",
    "X_train = (X_train - mean) / std\n",
    "X_test = (X_test - mean) / std\n",
    "\n",
    "\n",
    "# Define the mean squared error loss function\n",
    "def mse_loss(output, y):\n",
    "    return np.mean((output - y) ** 2)\n",
    "\n",
    "# Create a multi-layer neural network with 2 hidden layers\n",
    "net = MultiLayerNet(input_size=10, hidden_sizes=[20, 10], output_size=1,\n",
    "                    activation_function=Sigmoid(), loss_function=mse_loss, reg_lambda=0.01)\n",
    "\n",
    "# Train the network for 1000 epochs\n",
    "net.train(X_train, y_train, num_epochs=1000, learning_rate=0.01)\n",
    "\n",
    "# Evaluate the trained network on the test set\n",
    "output = net.forward(X_test)\n",
    "predicted_classes = np.round(output)\n",
    "accuracy = np.mean(predicted_classes == y_test)\n",
    "print(f\"Test accuracy: {accuracy}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
